Last updated: 2017-03-05

Notes

Readings

Bolker et al. ( ???): Generalized linear mixed models: a practical guide for ecology and evolution

Additional resources

Approachable:

  • Zuur et al. ( ???): Mixed Effects Models and Extensions in Ecology with R. But takes a very old fashioned view of model selection.

Bayesian:

  • McElreath ( ???): Statistical Rethinking: A Bayesian Course with Examples in R and Stan

MCMC:

  • Hadfield ( ???): MCMC Methods for Multi-response Generalized Linear Mixed Models: The MCMCglmm R Package

Technical:

  • Pinheiro and Bates ( ???): Mixed-effects models in S and S-PLUS

Goals for today

Very Important:

  • Understand the nature of the problems we are trying to solve
  • Understand the value in solving them

Moderately Important:

  • Understand some of the technical aspects of multilevel model fitting and interpretation

Minimally Important:

  • Understand the math involved

Assumptions of linear regression

  1. Normality (check pooled residuals)
  2. Homogeneity (check pooled residuals, transformation?)
  3. Fixed explanatory variables (plan sampling)
  4. Independence (observations often correlated)
  5. Correct model

Non-independence: The main challenge

Our tests (so far) assume independence of samples.

Lots of data have structure at additional levels beyond the main variable(s) of interest

  • Error in these levels is correlated (making the data non-independent)

We need a method to model that correlation structure.

Correlated data

Explicitly by design:

  • Grouping (leaves within plants within treatments within fields)
  • Repeated measurements of the same individual (time or space)

Implicitly as a side effect:

  • Non-equal sample sizes, missing data
  • Non-equal within-group variance
  • Phylogenetic structure ( ???)
  • Population structure and genetic relatedness ( ???)

Some terminology

Multilevel / hierarchical / mixed / repeated measures models

  • All refer to the same class of models, which contain both fixed and random variables.
  • Some will further subdivide and draw fine distinctions between models.

Some terminology

Fixed effects:

  • Explanatory variables of primary interest, which would be used again if the experiment were repeated
  • Sex, treatment groups, etc.

Some terminology

Random effects

  • Variables that are a "random" selection from a larger set of variables that allow for heterogeneity, correlation, phylogenetic or genetic structure, real random variation (\(\epsilon\)).
  • Subject, block, etc.
  • "Random" does not necessarily mean "not of interest."

Also see: http://andrewgelman.com/2005/01/25/why_i_dont_use/

Fundamental idea of multilevel models

Units are grouped at different levels.

  • Grouping takes different forms

General multilevel model formula

\[Y = \beta_0 + \beta_1 X_1 + ... + \beta_n X_n + \mbox{Random}\]

Where \(\mbox{Random}\) can be:

  • Random noise (\(\epsilon\))
  • Heterogeneity of variance
  • Nesting structure
  • Temporal correlation (autocorrelation)
  • Spatial correlation
  • Phylogenetic correlation
  • Genetic correlation

Turning flight in hummingbirds

Turning flight in hummingbirds

Turning flight in hummingbirds

Turning flight in hummingbirds

Turning flight in hummingbirds

Turning flight in hummingbirds

Turning flight in hummingbirds

Hierarchical structure

  • Maneuver (hover vs. clockwise, hover vs. counterclockwise)
    • Bird (4)
      • Trial (5)
        • Wingbeat (mean of 16 \(\rightarrow\) 1)
  1. Maneuver is the variable of interest ("fixed")
  2. Birds are a representative sample of all birds ("random")
  3. Trials are repeated measures of each bird ("random")
  4. Wingbeats are the observation level

What we did

  1. Calculate means of 16 wingbeats for each bird for each trial
    • Loss of information, but…
  2. Five trials per bird
  3. Multilevel ANOVA
    • Bird and trial nested within bird as random effects
    • Maneuver as the fixed effect

Each bird gets its own mean value for a maneuver (across 5 trials). Maneuver is analyzed as the mean difference between each maneuver.

Challenges to multilevel models

Conceptual complexity

  • You must know your model
  • Difficult to visualize
  • Do my results make sense?

Computational complexity

  • Many methods fit by likelihood
    • Singularities abound, differentiation doesn't always work
  • MCMC can be slow, difficult to tune
  • Bayesian requires defining priors

Fitting multilevel models

Among the options for R packages:

  • nlme: Oldest, restricted to a subset of models (linear and non-linear Gaussian), for us the place to start
  • lme4: Newer (but same authors as nlme), can handle more complex models with crossed random effects, pedigrees, non-Gaussian response, etc., more flexible distributions of response variables, but harder to test hypotheses
  • MCMCglmm, rstan, rjags: Fit Bayesian multilevel models via a variety of MCMC samplers

You have already fit several multilevel models

Intraclass correlation coefficient:

  • Repeated measurements for a single specimen
  • What is the relative among measurement vs. between specimen variation

Intraclass correlation coefficient

Intraclass correlation coefficient

Lengths <- data.frame(Fem_Len = c(M$Len1, M$Len2),
                      MouseID = factor(c(M$MouseID, M$MouseID)))
Lengths <- Lengths %>% drop_na()
fm <- anova(lm(Fem_Len ~ MouseID, data = Lengths))
print(fm, digits = 5)
## Analysis of Variance Table
## 
## Response: Fem_Len
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## MouseID   118 75.023 0.63579  3510.8 < 2.2e-16 ***
## Residuals 119  0.022 0.00018                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
var_a <- (fm$'Mean Sq'[1] - fm$'Mean Sq'[2]) / 2
var_a / (var_a + fm$'Mean Sq'[2])
## [1] 0.9994305

ICC with lme()

library(nlme)
fm_lme <- lme(Fem_Len ~ 1,
              random = ~ 1 | MouseID,
              data = Lengths,
              method = "ML")
head(coef(fm_lme))
##       (Intercept)
## 46001    16.21499
## 46002    17.08474
## 46003    16.77983
## 46004    16.64986
## 46012    16.85481
## 46013    15.99005

ICC with lme()

Lengths %>% group_by(MouseID) %>% summarize(mean(Fem_Len)) %>% head()
## # A tibble: 6 × 2
##   MouseID `mean(Fem_Len)`
##    <fctr>           <dbl>
## 1   46001          16.215
## 2   46002          17.085
## 3   46003          16.780
## 4   46004          16.650
## 5   46012          16.855
## 6   46013          15.990

ICC with lme()

(varcomps <- VarCorr(fm_lme))
## MouseID = pdLogChol(1) 
##             Variance     StdDev    
## (Intercept) 0.3151310080 0.56136531
## Residual    0.0001810924 0.01345706
var.among <- as.numeric(varcomps[1, 1])
var.within <- as.numeric(varcomps[2, 1])
var.among / (var.among + var.within)
## [1] 0.9994257

Note: very close but not numerically equivalent to the anova() version. Models are slightly different.

Loligo forbesii

Seasonal patterns in testis mass in the squid Loligo. Testis mass predicted by dorsal mantle length, month, and DML X month.

Squid data

S <- read_excel("../data/Squid.xlsx")
S <- S %>% mutate(Month = factor(Month))
str(S)
## Classes 'tbl_df', 'tbl' and 'data.frame':    768 obs. of  5 variables:
##  $ Specimen   : num  1017 1034 1070 1070 1019 ...
##  $ Year       : num  1991 1990 1990 1990 1990 ...
##  $ Month      : Factor w/ 12 levels "1","2","3","4",..: 2 9 12 11 8 10 5 7 7 7 ...
##  $ DML        : num  136 144 108 130 121 117 133 105 109 97 ...
##  $ Testis_Mass: num  0.006 0.008 0.008 0.011 0.012 0.012 0.013 0.015 0.017 0.017 ...
# Check for missing data
sum(!complete.cases(S))
## [1] 0

Visualizing

Visualizing

Multiple regression

Testis mass modeled by DML, Month, and the DML X Month interaction:

fm1 <- lm(Testis_Mass ~ DML + Month + DML:Month, data = S)

Extract the residuals and fitted values:

S$Res1 <- residuals(fm1)
S$Fitted <- fitted.values(fm1)

Residuals vs. Fitted Values

Residuals vs. Month

Dealing with heterogeneity within groups

What we need:

  1. Allow for change in variance with values of DML (here, more spread with larger values)
  2. Allow for difference variances per month

Solution:

  • Generalized least squares \(\approx\) "Weighted regression"
  • nlme::gls()
  • Also one solution for phylogenetic least squares (coming soon)

Fitting a weighted regression

library(nlme)
variance_struct <- varPower(form = ~ DML | Month)
  • This structure allows the residual variance to change with DML on a per-Month basis.
  • Here a power relationship
    • Other variance structures available
fm2 <- gls(Testis_Mass ~ DML * Month, data = S,
           weights = variance_struct)

Diagnostics

plot(fm2)

Summary

library(car)
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
Anova(fm2, type = "III")
## Analysis of Deviance Table (Type III tests)
## 
## Response: Testis_Mass
##             Df   Chisq Pr(>Chisq)    
## (Intercept)  1  12.830  0.0003411 ***
## DML          1  97.686  < 2.2e-16 ***
## Month       11 131.398  < 2.2e-16 ***
## DML:Month   11 263.769  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualization of the model

S$Res2 <- residuals(fm2, type = "normalized")

Visualization of the model

Coefficients

##                coef(fm2)
## (Intercept) -4.780826964
## DML          0.046586983
## Month2       1.334893906
## Month3       3.661220332
## Month4       3.410126677
## Month5       3.953042298
## Month6      -0.666390313
## Month7       3.669116396
## Month8       3.241297524
## Month9      -1.646885027
## Month10      0.520272100
## Month11     -1.742198739
## Month12     -0.957170988
## DML:Month2  -0.007741310
## DML:Month3  -0.020683250
## DML:Month4  -0.023433233
## DML:Month5  -0.028570828
## DML:Month6  -0.012384582
## DML:Month7  -0.036971343
## DML:Month8  -0.036261427
## DML:Month9  -0.002745741
## DML:Month10 -0.008962055
## DML:Month11  0.007326031
## DML:Month12  0.007205327

Nested data

Data with hierarchical structure:

  • i samples within n blocks

We could just take a means of each block and use those data for ANOVA.

  • Lose all the information about variability within blocks
  • Might increase apparent variance

Intertidal species richness

Collected by National Institute for Coastal and Marine Management (RIKZ)

Intertidal data

RIKZ <- read_excel("../data/RIKZ.xlsx")
glimpse(RIKZ)
## Observations: 45
## Variables: 5
## $ Sample   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ Richness <dbl> 11, 10, 13, 11, 10, 8, 9, 8, 19, 17, 6, 1, 4, 3, 3, 1...
## $ Exposure <dbl> 10, 10, 10, 10, 10, 8, 8, 8, 8, 8, 11, 11, 11, 11, 11...
## $ NAP      <dbl> 0.045, -1.036, -1.336, 0.616, -0.684, 1.190, 0.820, 0...
## $ Beach    <dbl> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4,...
  • Richness: Number of species
  • NAP: Height above mean tidal level
  • Exposure: Index composed of wave action, length of the surf zone, slope, grain size, and the depth of the anaerobic layer

Preparing data

Convert Beach to factor.

RIKZ <- RIKZ %>% mutate(Beach = factor(Beach))

Sampling structure

Sites are the level of the observation.

Sampling structure

RIKZ %>% group_by(Beach) %>% tally()
## # A tibble: 9 × 2
##    Beach     n
##   <fctr> <int>
## 1      1     5
## 2      2     5
## 3      3     5
## 4      4     5
## 5      5     5
## 6      6     5
## 7      7     5
## 8      8     5
## 9      9     5

5 observations at each of 9 beaches

Visualizing NAP

Visualizing Exposure

Each Beach only has 1 value of Exposure.

Naive multiple regression model

fm1 <- lm(Richness ~ NAP + Beach, data = RIKZ)
Anova(fm1, type = "III")
## Anova Table (Type III tests)
## 
## Response: Richness
##             Sum Sq Df F value    Pr(>F)    
## (Intercept) 466.37  1 49.8039 3.224e-08 ***
## NAP         230.66  1 24.6322 1.793e-05 ***
## Beach       416.37  8  5.5581 0.0001441 ***
## Residuals   327.74 35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Naive multiple regression model

Maybe we are not interested in knowing the exact nature of the beach effect.

Multilevel model

Allow a different mean richness per beach:

fm2 <- lme(Richness ~ NAP, random = ~ 1 | Beach, data = RIKZ)
  • Similar to the model we used for intraclass correlation coefficient.

Diagnostics

plot(fm2)

Summary

## Linear mixed-effects model fit by REML
##  Data: RIKZ 
##        AIC     BIC    logLik
##   247.4802 254.525 -119.7401
## 
## Random effects:
##  Formula: ~1 | Beach
##         (Intercept) Residual
## StdDev:    2.944065  3.05977
## 
## Fixed effects: Richness ~ NAP 
##                 Value Std.Error DF   t-value p-value
## (Intercept)  6.581893 1.0957618 35  6.006682       0
## NAP         -2.568400 0.4947246 35 -5.191574       0
##  Correlation: 
##     (Intr)
## NAP -0.157
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.4227495 -0.4848006 -0.1576462  0.2518966  3.9793918 
## 
## Number of Observations: 45
## Number of Groups: 9

Visualizing the model

Generate predicted values for overall population level estimates and within beach estimates.

# Overall
RIKZ$Pred0 <- fitted(fm2, level = 0)

# Within beach
RIKZ$Pred1 <- fitted(fm2, level = 1)

Visualizing the model

Does the relationship differ per beach?

Allow each Beach to have its own NAP to Richness linear relationship:

fm3 <- lme(Richness ~ NAP, random = ~ NAP | Beach, data = RIKZ)

Now the NAP to Richness relationship can vary for each Beach.

Model formula for random effects:

\[\sim \mbox{Slope}~|~\mbox{Intercept}\]

Summary

summary(fm3)
## Linear mixed-effects model fit by REML
##  Data: RIKZ 
##        AIC      BIC    logLik
##   244.3839 254.9511 -116.1919
## 
## Random effects:
##  Formula: ~NAP | Beach
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev   Corr  
## (Intercept) 3.549064 (Intr)
## NAP         1.714963 -0.99 
## Residual    2.702820       
## 
## Fixed effects: Richness ~ NAP 
##                 Value Std.Error DF   t-value p-value
## (Intercept)  6.588706  1.264761 35  5.209448   0e+00
## NAP         -2.830028  0.722940 35 -3.914610   4e-04
##  Correlation: 
##     (Intr)
## NAP -0.819
## 
## Standardized Within-Group Residuals:
##        Min         Q1        Med         Q3        Max 
## -1.8213326 -0.3411043 -0.1674617  0.1921216  3.0397049 
## 
## Number of Observations: 45
## Number of Groups: 9

Visualizing the model

Generate predicted values for overall population level estimates and within beach estimates.

# Overall
RIKZ$Pred0 <- fitted(fm3, level = 0)

# Within beach
RIKZ$Pred1 <- fitted(fm3, level = 1)

Visualizing the model

Coefficients

##   (Intercept)       NAP
## 1    8.421057 -3.656268
## 2   12.363497 -5.536814
## 3    3.806643 -1.505716
## 4    3.562422 -1.385960
## 5   11.200150 -5.137323
## 6    4.426279 -1.775675
## 7    4.082944 -1.644329
## 8    5.099895 -2.106852
## 9    6.335471 -2.721314

Longitudinal studies

Sequential measurements of the same individual / plot / organism, etc. will result in within-subject correlation.

  • Growth curves
  • Long-term ecological monitoring

Cranial growth rates in children

Distance from the sella turcica to the pterygomaxillary fissure in growing children.

Orthodonics data

O <- read_excel("../data/Ortho.xlsx")
O <- O %>% mutate(Subject = factor(Subject),
                  Sex = factor(Sex))
glimpse(O)
## Observations: 108
## Variables: 4
## $ Distance <dbl> 26.0, 25.0, 29.0, 31.0, 21.5, 22.5, 23.0, 26.5, 23.0,...
## $ Age      <dbl> 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 12, 14, 8, 10, 1...
## $ Subject  <fctr> M01, M01, M01, M01, M02, M02, M02, M02, M03, M03, M0...
## $ Sex      <fctr> Male, Male, Male, Male, Male, Male, Male, Male, Male...

Orthodonics data

O %>% group_by(Sex) %>% tally()
## # A tibble: 2 × 2
##      Sex     n
##   <fctr> <int>
## 1 Female    44
## 2   Male    64
range(O$Age)
## [1]  8 14

Potthoff, R. F. and Roy, S. N. (1964) A generalized multivariate analysis of variance model useful especially for growth curve problems. Biometrika 51: 313-326.

Visualizing

Visualizing

Do boys and girls grow at different rates?

  • Main effects of Age and Sex (like an ANCOVA model).
  • Random effects allow the Age by Distance relationship (slope) to vary for each Subject, which is also allowed to have its own mean.
fm2 <- lme(Distance ~ Sex + Age, random = ~ Age | Subject,
           method = "ML",
           data = O)

Predicted values

# Overall
O$Pred0 <- fitted(fm2, level = 0)

# Age within Subject
O$Pred1 <- fitted(fm2, level = 1)

Visualizing

Summary

## Linear mixed-effects model fit by maximum likelihood
##  Data: O 
##        AIC      BIC    logLik
##   446.8352 465.6101 -216.4176
## 
## Random effects:
##  Formula: ~Age | Subject
##  Structure: General positive-definite, Log-Cholesky parametrization
##             StdDev    Corr  
## (Intercept) 2.6447350 (Intr)
## Age         0.2149243 -0.76 
## Residual    1.3100397       
## 
## Fixed effects: Distance ~ Sex + Age 
##                 Value Std.Error DF   t-value p-value
## (Intercept) 15.489709 0.9328629 80 16.604486  0.0000
## SexMale      2.145491 0.7391991 25  2.902453  0.0076
## Age          0.660185 0.0709131 80  9.309772  0.0000
##  Correlation: 
##         (Intr) SexMal
## SexMale -0.470       
## Age     -0.792  0.000
## 
## Standardized Within-Group Residuals:
##         Min          Q1         Med          Q3         Max 
## -3.16563246 -0.45463448  0.01446427  0.44559465  3.90045094 
## 
## Number of Observations: 108
## Number of Groups: 27

ANOVA table

Anova(fm2, type = "III")
## Analysis of Deviance Table (Type III tests)
## 
## Response: Distance
##                Chisq Df Pr(>Chisq)    
## (Intercept) 283.5863  1  < 2.2e-16 ***
## Sex           8.6649  1   0.003244 ** 
## Age          89.1482  1  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Intercepts only model

fm1 <- lme(Distance ~ Sex + Age, random = ~ 1 | Subject, 
           method = "ML",
           data = O)

# Overall
O$Pred0 <- fitted(fm1, level = 0)

# Age within Subject
O$Pred1 <- fitted(fm1, level = 1)

Visualizing

Model comparison

anova(fm1, fm2)
##     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
## fm1     1  5 444.8565 458.2671 -217.4282                        
## fm2     2  7 446.8352 465.6101 -216.4176 1 vs 2 2.021324   0.364

Reasons to use multilevel models

Adapted from McElreath ( ???)

  1. To adjust estimates for repeat sampling. When more than one observation arises from the same individual, location, or time, then traditional, single-level models may mislead us.
  2. To adjust estimates for imbalance in sampling. When some individuals, locations, or times are sampled more than others, we may also be misled by single-level models.
  3. To study variation. If research questions include variation among individuals or other groups within the data, multilevel models help, because they model variation explicitly.
  4. To avoid averaging. Avoid pre-average to construct variables for a regression analysis. Averaging removes variation. Multilevel models allow us to preserve the uncertainty in the original, pre-averaged values, while still using the average to make predictions.

Can poor experimental design be saved by multilevel models?

No.

Quiz 08-3

Watch Lecture 08-4

References